Goto

Collaborating Authors

 image-goal navigation



FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation

Neural Information Processing Systems

Learning to navigate to an image-specified goal is an important but challenging task for autonomous systems like household robots. The agent is required to well understand and reason the location of the navigation goal from a picture shot in the goal position. Existing methods try to solve this problem by learning a navigation policy, which captures semantic features of the goal image and observation image independently and lastly fuses them for predicting a sequence of navigation actions. However, these methods suffer from two major limitations.



IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation

Guo, Wenxuan, Xu, Xiuwei, Yin, Hang, Wang, Ziwei, Feng, Jianjiang, Zhou, Jie, Lu, Jiwen

arXiv.org Artificial Intelligence

Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the more challenging free-view image-goal setting and be deployed on real-world robotic platform using a cellphone to capture goal image at arbitrary pose. Project page: https://gwxuan.github.io/IGL-Nav/.


Hierarchical Scoring with 3D Gaussian Splatting for Instance Image-Goal Navigation

Deng, Yijie, Yuan, Shuaihang, Bethala, Geeta Chandra Raju, Tzes, Anthony, Liu, Yu-Shen, Fang, Yi

arXiv.org Artificial Intelligence

Instance Image-Goal Navigation (IIN) requires autonomous agents to identify and navigate to a target object or location depicted in a reference image captured from any viewpoint. While recent methods leverage powerful novel view synthesis (NVS) techniques, such as three-dimensional Gaussian splatting (3DGS), they typically rely on randomly sampling multiple viewpoints or trajectories to ensure comprehensive coverage of discriminative visual cues. This approach, however, creates significant redundancy through overlapping image samples and lacks principled view selection, substantially increasing both rendering and comparison overhead. In this paper, we introduce a novel IIN framework with a hierarchical scoring paradigm that estimates optimal viewpoints for target matching. Our approach integrates cross-level semantic scoring, utilizing CLIP-derived relevancy fields to identify regions with high semantic similarity to the target object class, with fine-grained local geometric scoring that performs precise pose estimation within promising regions. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on simulated IIN benchmarks and real-world applicability.


Image-Goal Navigation Using Refined Feature Guidance and Scene Graph Enhancement

Feng, Zhicheng, Chen, Xieyuanli, Shi, Chenghao, Luo, Lun, Chen, Zhichao, Liu, Yun-Hui, Lu, Huimin

arXiv.org Artificial Intelligence

In this paper, we introduce a novel image-goal navigation approach, named RFSG. Our focus lies in leveraging the fine-grained connections between goals, observations, and the environment within limited image data, all the while keeping the navigation architecture simple and lightweight. To this end, we propose the spatial-channel attention mechanism, enabling the network to learn the importance of multi-dimensional features to fuse the goal and observation features. In addition, a selfdistillation mechanism is incorporated to further enhance the feature representation capabilities. Given that the navigation task needs surrounding environmental information for more efficient navigation, we propose an image scene graph to establish feature associations at both the image and object levels, effectively encoding the surrounding scene information. Crossscene performance validation was conducted on the Gibson and HM3D datasets, and the proposed method achieved stateof-the-art results among mainstream methods, with a speed of up to 53.5 frames per second on an RTX3080. This contributes to the realization of end-to-end image-goal navigation in realworld scenarios. The implementation and model of our method have been released at: https://github.com/nubot-nudt/RFSG.


FGPrompt: Fine-grained Goal Prompting for Image-goal Navigation

Neural Information Processing Systems

Learning to navigate to an image-specified goal is an important but challenging task for autonomous systems like household robots. The agent is required to well understand and reason the location of the navigation goal from a picture shot in the goal position. Existing methods try to solve this problem by learning a navigation policy, which captures semantic features of the goal image and observation image independently and lastly fuses them for predicting a sequence of navigation actions. However, these methods suffer from two major limitations. In this paper, we aim to overcome these limitations by designing a Fine-grained Goal Prompting (\sexyname) method for image-goal navigation. In particular, we leverage fine-grained and high-resolution feature maps in the goal image as prompts to perform conditioned embedding, which preserves detailed information in the goal image and guides the observation encoder to pay attention to goal-relevant regions.


Transformers for Image-Goal Navigation

Pelluri, Nikhilanj

arXiv.org Artificial Intelligence

Autonomous navigation in environments is a critical capability for modern mobile robots, and has been extensively studied over several decades. Classical approaches to navigation rely on constructing detailed maps of the environment and accurate localization of the robot within the map [1, 2, 3]. However, with increasing demand for deploying robots in novel uncontrolled environments such as households, last-mile delivery, etc., constructing accurate and fine-grained maps frequently is often impractical. Robots must now be able to navigate without maps, which means efficient navigation policies require accurate semantic understanding of the scene, efficient exploration and episodic memory, and long-horizon planning with limited knowledge of the environment. Advances in scene understanding and a have led to semantic navigation tasks such as image-goal navigation [4, 5], object-goal navigation [6, 7], etc. receiving significant focus in recent years. In this work, we consider the specific task of image-goal navigation where robot's navigation objective is specified by an RGB image. We motivate the task with the following scenario: Consider a mobile household robot equipped with an onboard camera tasked with picking up a novel unseen object (say, a new shirt). Since the robot has no prior knowledge about the novel object, it would need other semantic information to understand the object - an image of the object would serve this purpose effectively.


Last-Mile Embodied Visual Navigation

Wasserman, Justin, Yadav, Karmesh, Chowdhary, Girish, Gupta, Abhinav, Jain, Unnat

arXiv.org Artificial Intelligence

Realistic long-horizon tasks like image-goal navigation involve exploratory and exploitative phases. Assigned with an image of the goal, an embodied agent must explore to discover the goal, i.e., search efficiently using learned priors. Once the goal is discovered, the agent must accurately calibrate the last-mile of navigation to the goal. As with any robust system, switches between exploratory goal discovery and exploitative last-mile navigation enable better recovery from errors. Following these intuitive guide rails, we propose SLING to improve the performance of existing image-goal navigation systems. Entirely complementing prior methods, we focus on last-mile navigation and leverage the underlying geometric structure of the problem with neural descriptors. With simple but effective switches, we can easily connect SLING with heuristic, reinforcement learning, and neural modular policies. On a standardized image-goal navigation benchmark (Hahn et al. 2021), we improve performance across policies, scenes, and episode complexity, raising the state-of-the-art from 45% to 55% success rate. Beyond photorealistic simulation, we conduct real-robot experiments in three physical scenes and find these improvements to transfer well to real environments.